Marketing Analytics Process

Predictive Modeling Workflow

Supervised Learning and Prediction

We’ve already spent time with supervised learning, a model with an outcome variable. Specifically, we dealt with regression and classification.

We used supervised learning for inference (i.e., to understand the underlying data generating process), but now we care about prediction. So instead of letting our theories about the data generating process drive our selection of variables and worrying about whether our coefficient estimates are accurate, we’ll need to run a lot of models and find which one is best for prediction.

Import Data

Let’s import and work with some new data.

# Load packages.
library(tidyverse)
library(tidymodels)

# Set a simulation seed.
set.seed(42)

In this unit, we’ll be examining survey data from iRobot (Roomba).

roomba_survey <- read_csv(here::here("Data", "roomba_survey.csv"))

roomba_survey
## # A tibble: 332 × 128
##    sys_RespNum sys_StartTime sys_EndTime sys_LastQuestion sys_CBC_CBC1_design   
##          <dbl>         <dbl>       <dbl> <chr>            <chr>                 
##  1           2    1456893467  1456893958 Finished         [[1,1,1,2,1,2,2,1],[2…
##  2           3    1456893643  1456893998 Finished         [[1,4,2,1,2,1,1,1],[2…
##  3           4    1456893769  1456893998 Finished         [[1,3,2,2,1,2,1,4],[2…
##  4           9    1456895699  1456904874 Finished         [[1,2,2,1,1,1,1,3],[2…
##  5          23    1456924935  1456925731 Finished         [[1,1,1,2,1,2,1,4],[2…
##  6          24    1456930656  1456932188 Finished         [[1,3,2,2,1,1,1,2],[2…
##  7          28    1456943719  1456943970 Finished         [[1,4,2,1,2,1,1,1],[2…
##  8          30    1456945961  1456946585 Finished         [[1,4,1,1,1,1,2,2],[2…
##  9          31    1456946554  1456946910 Finished         [[1,1,2,2,2,1,2,2],[2…
## 10          33    1456946838  1456947219 Finished         [[1,1,1,1,2,1,2,3],[2…
## # ℹ 322 more rows
## # ℹ 123 more variables: sys_CBC_CBC1_design_info <chr>, S1 <dbl>, S1A <dbl>,
## #   S1B <dbl>, S1C <dbl>, S1C_9_other <chr>, S2 <dbl>, S3Age <dbl>,
## #   S4Income <dbl>, CleaningAttitudes_1 <dbl>, CleaningAttitudes_2 <dbl>,
## #   CleaningAttitudes_3 <dbl>, CleaningAttitudes_4 <dbl>,
## #   CleaningAttitudes_5 <dbl>, CleaningAttitudes_6 <dbl>,
## #   CleaningAttitudes_7 <dbl>, CleaningAttitudes_8 <dbl>, …

# Answers to S1? This is Q1 in the survey dictionary, i.e., the first screening question.
roomba_survey |> 
  count(S1)
## # A tibble: 3 × 2
##      S1     n
##   <dbl> <int>
## 1     1    40
## 2     3    63
## 3     4   229

Outcome Variable

Going forward, we will perform feature engineering on our outcome variable first. In other words, we pre-process our outcome variable first, then split and pre-process our training and testing data.

# Wrangle S1 into segment.
roomba_survey <- roomba_survey |> 
  rename(segment = S1) |> 
  mutate(
    # easier way to do multiple if-else statements!
    segment = case_when(
      segment == 1 ~ "own",
      segment == 3 ~ "shopping",
      segment == 4 ~ "considering"
    ),
    segment = factor(segment)
  )

Data Splitting by Strata

Once again, one of the first things we need to do is split the data. To be sure that we don’t end up with testing data that doesn’t include one of the categories in our outcome variable, use the strata argument.

# Split data based on segment.
roomba_split <- initial_split(roomba_survey, prop = 0.75, strata = segment)

How could you check and see that this worked? (Hint: when using the strata argument, the splitting proportion is applied to each stratum separately).

Decision Trees

We might be tempted to use a logistic regression since this is a classification model, but why wouldn’t it work for this outcome?

Let’s use a decision tree.

  • Instead of fitting a line, split the data based on a decision rule using one of the predictors.
  • Keep adding decision rules based on more predictors to split the data further.
  • Use the resulting regions to classify (or minimize the residual sum of squares if its regression).

Clear as mud?

Set the model, engine, and mode (since a decision tree can do regression or classification).

# Set the model, engine, and mode.
roomba_model <- decision_tree() |> 
  set_engine(engine = "rpart") |> 
  set_mode("classification")

Without fit(), the set up here is a list of instructions. Where have we seen this before?

roomba_model
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Workflows

A workflow is a tidymodels object that combines the instructions of a recipe and a model.

# Create a workflow.
roomba_wf_01 <- workflow() |> 
  add_formula(
    segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
      CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
      CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
      CleaningAttitudes_10 + CleaningAttitudes_11
  ) |> 
  add_model(roomba_model)

roomba_wf_01
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
##     CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
##     CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
##     CleaningAttitudes_10 + CleaningAttitudes_11
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Fit a Workflow

We can fit the workflow itself since it includes the formula and model instructions.

# Fit a workflow.
wf_fit_01 <- fit(roomba_wf_01, data = training(roomba_split))

wf_fit_01
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## segment ~ CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
##     CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
##     CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
##     CleaningAttitudes_10 + CleaningAttitudes_11
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

Evaluate Predictive Fit

Similarly, we can evaluate predictive fit using the fitted workflow. For classification, there are a lot of possible measures of predictive fit, but accuracy is a natural one to use.

# Compute model accuracy.
wf_fit_01 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

We can also compute a confusion matrix.

# Compute a confusion matrix.
wf_fit_01 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Feature Engineering

We can certainly do better. What if we included some demographic predictors instead? We’ll still want to dummy code them.

# Build a recipe.
roomba_recipe <- training(roomba_split) |>
  recipe(
    segment ~ D1Gender + D2HomeType + D3Neighborhood + D4MaritalStatus
  ) |>
  step_dummy(all_nominal(), -all_outcomes())

Note that we haven’t used prep() – that’s part of executing the workflow now.

Update, Fit, and Evaluate

We didn’t have a recipe() before, so we had to use add_formula(). Let’s get rid of it and update our workflow with our new recipe.

# Update the workflow.
roomba_wf_02 <- roomba_wf_01 |> 
  remove_formula() |>
  add_recipe(roomba_recipe)

roomba_wf_02
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

By fitting a workflow that includes a recipe, prep(), bake(), and fit() are all executed in one step.

# Fit a second workflow.
wf_fit_02 <- fit(roomba_wf_02, data = training(roomba_split))

wf_fit_02
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

By predicting using a fitted workflow, it does bake() and predict() in one step.

# Compute model accuracy.
wf_fit_02 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

# Compute a confusion matrix.
wf_fit_02 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Predict using Cleaning Attitudes and Demographics

# Build a new recipe.
roomba_recipe_full <- training(roomba_split) |>
  recipe(
    segment ~ D1Gender + D2HomeType + D3Neighborhood + D4MaritalStatus +
      CleaningAttitudes_1 + CleaningAttitudes_2 + CleaningAttitudes_3 + 
      CleaningAttitudes_4 + CleaningAttitudes_5 + CleaningAttitudes_6 + 
      CleaningAttitudes_7 + CleaningAttitudes_8 + CleaningAttitudes_9 + 
      CleaningAttitudes_10 + CleaningAttitudes_11
  ) |>
  step_dummy(all_nominal(), -all_outcomes())

Update the Workflow Again

Notice we are updating the recipe this time.

# Update the workflow.
roomba_wf_03 <- roomba_wf_02 |> 
  remove_recipe() |>
  add_recipe(roomba_recipe_full)

roomba_wf_03
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Fit the Workflow

# Fit the full workflow.
wf_fit_03 <- fit(roomba_wf_03, data = training(roomba_split))

wf_fit_03
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 248 77 considering (0.68951613 0.12096774 0.18951613)  
##    2) CleaningAttitudes_10< 0.5 115 28 considering (0.75652174 0.10434783 0.13913043) *
##    3) CleaningAttitudes_10>=0.5 133 49 considering (0.63157895 0.13533835 0.23308271)  
##      6) CleaningAttitudes_4< 0.5 63 20 considering (0.68253968 0.19047619 0.12698413) *
##      7) CleaningAttitudes_4>=0.5 70 29 considering (0.58571429 0.08571429 0.32857143)  
##       14) D1Gender< 1.5 51 16 considering (0.68627451 0.05882353 0.25490196) *
##       15) D1Gender>=1.5 19  9 shopping (0.31578947 0.15789474 0.52631579) *

Evaluate Predictions from the Full Model

# Compute model accuracy.
wf_fit_03 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.702

Confusion Matrix for the Full Model

# Compute a confusion matrix.
wf_fit_03 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          56   9       13
##   own                   0   0        0
##   shopping              2   1        3

Notes on Growing (Training/Fitting) a Decision Tree

Workflows help with the fact that we’re iterating on a lot of models as we try different predictors and changes to feature engineering. But we haven’t said anything about hyperparameters.

  • tree_depth maximum depth of the tree (default is 30)
  • min_n minimum number of data points in a node to be split (default is 20)

# Specify the model, engine, and mode.
roomba_model <- decision_tree(tree_depth = 2) |> 
  set_engine(engine = "rpart") |> 
  set_mode("classification")

roomba_model
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = 2
## 
## Computational engine: rpart

# Update the workflow.
roomba_wf_03 <- roomba_wf_03 |> 
  update_model(roomba_model)

roomba_wf_03
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   tree_depth = 2
## 
## Computational engine: rpart

# Fit the workflow.
wf_fit_03 <- fit(roomba_wf_03, data = training(roomba_split))

wf_fit_03
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 248 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 248 77 considering (0.6895161 0.1209677 0.1895161) *

# Compute model accuracy.
wf_fit_03 |> 
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  accuracy(truth = segment, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.690

# Compute a confusion matrix.
wf_fit_03 |>
  predict(new_data = testing(roomba_split)) |>
  bind_cols(testing(roomba_split)) |>
  conf_mat(truth = segment, estimate = .pred_class)
##              Truth
## Prediction    considering own shopping
##   considering          58  10       16
##   own                   0   0        0
##   shopping              0   0        0

Wrapping Up

Summary

  • Demonstrated splitting data by strata.
  • Discussed decision trees.
  • Walked through building, using, and updating workflows.
  • Tried a little hyperparameter tuning.

Next Time

  • Why have one decision tree when you can have a forest?

Supplementary Material

  • Tidy Modeling with R Chapter 7

Artwork by @allison_horst

Exercise 16

  1. Use case_when() to combine the own and shopping segments into a single category.
  2. Use workflows to combine the cleaning attitudes and demographics we’ve used in class and fit both a logistic regression and a decision tree. Which one has better predictive fit?
  3. Try and tune the decision tree hyperparameters to improve its predictive fit.
  4. Render the Quarto document into Word and upload to Canvas.